Data Robustness Engine¶


0) Imports and Setup¶

In [1]:
from robustness_checker import RobustnessChecker

import warnings
warnings.filterwarnings('ignore')

Dataset Selection and Configurations¶

Select only one of the lines of code below to be run, based on the dataset one would like to be tested.

You may toggle the date1 and date2 arguments, which represent the date range to consider on the dataset. It must have the format 'YYYY-MM-DD'.

In [2]:
rc = RobustnessChecker(
    data_set='eia-weekly-psr',
    frequency='frequency', use_freq_col=True,
    freq_mapper= {
        '4-Week': 'weekly-friday',
        'Weekly': 'weekly-friday',
    },
    date1='min',
    date2='max',
)

1) Missing values¶

  • Check for NULL values (i.e. blank fields) in the dataset:
    • per column
    • per instrument
    • per date
    • in the frequency column (if applicable)
In [3]:
# Constructs a bar chart, showing the number of nulls by COLUMN.
rc.identify_missing_values_per_column();
===== 4-WEEK =====
In total, there were 702551 (5.32%) cells in the dataset that were null.
	- There are 447152 (50.82%) of rows that have at least one null.
	- There are 426 (97.26%) instruments that have at least one null.
Percentage of missing nulls for each column:
Exported all the columns with nulls:
	_exports/source/eia-weekly-psr/1982-08-20_to_2024-09-27/4-Week/missing-fields/columns.csv
===== WEEKLY =====
In total, there were 866126 (4.49%) cells in the dataset that were null.
	- There are 607718 (47.25%) of rows that have at least one null.
	- There are 556 (89.25%) instruments that have at least one null.
Percentage of missing nulls for each column:
Exported all the columns with nulls:
	_exports/source/eia-weekly-psr/1982-08-20_to_2024-09-27/Weekly/missing-fields/columns.csv
In [4]:
# Identifies the INSTRUMENTS that have nulls (in the 'value' column), generating a box plot for the number missing for each instrument.
rc.identify_missing_values_per_instrument();
===== 4-WEEK =====
Found 426 (97.26%) instruments with at least one null in the 'value' field.
Exported instruments with nulls (under 'value'):
	_exports/source/eia-weekly-psr/1982-08-20_to_2024-09-27/4-Week/missing-fields/instruments/value.csv
===== WEEKLY =====
Found 556 (89.25%) instruments with at least one null in the 'value' field.
Exported instruments with nulls (under 'value'):
	_exports/source/eia-weekly-psr/1982-08-20_to_2024-09-27/Weekly/missing-fields/instruments/value.csv
In [5]:
# Identifies the DATES that have nulls (in the 'value' column), generating a line plot showing how the nulls changes over time.
rc.identify_missing_values_per_date();
===== 4-WEEK =====
Found 2190 dates with at least one null in the 'value' field.
Exported dates with nulls (under 'value'):
	_exports/source/eia-weekly-psr/1982-08-20_to_2024-09-27/4-Week/missing-fields/dates/value.csv
===== WEEKLY =====
Found 2192 dates with at least one null in the 'value' field.
Exported dates with nulls (under 'value'):
	_exports/source/eia-weekly-psr/1982-08-20_to_2024-09-27/Weekly/missing-fields/dates/value.csv
In [6]:
# Identifies the rows that had nulls under the FREQUENCY column.
rc.identify_missing_frequencies();
GOOD! No null values found under the frequency column.

2) Consistency / Integrity¶

  • Check whether the general STRUCTURE of the dataset is FIXED/CONSISTENT (i.e. what one would expect):
    • at a set level (across the whole dataset) for
      • inconsistent intervals between adjacent date values according to the frequency
      • irregular correspondance between instruments and unique metadata combinations
      • irregular number of instruments per date
      • irregular number of unique metadata combinations per date
    • at an instrument level (in individual instruments and time series) for those which:
      • do not start at the minimum datetime found in the dataset
      • do not end at the maximum datetime found in the dataset
      • had inconsistent date intervals according to its frequency.
      • had dates that were outside its frequency.
In [7]:
# Identify the DATES that were absent when taking into account the frequency between the minimum and maximum times
rc.identify_missing_intervals();
===== 4-WEEK =====
3 dates/times were ABSENT from the expected interval (4-Week).
Exported the dates that were absent:
	_exports/source/eia-weekly-psr/1982-08-20_to_2024-09-27/4-Week/inconsistencies/absent-dates.csv
===== WEEKLY =====
6 dates/times were ABSENT from the expected interval (Weekly).
Exported the dates that were absent:
	_exports/source/eia-weekly-psr/1982-08-20_to_2024-09-27/Weekly/inconsistencies/absent-dates.csv
In [8]:
# Get an overview of the distribution of time intervals that exist.
rc.output_intervals();
===== 4-WEEK =====
Outputting the possible variety of intervals (measured in days) between adjacent dates: 
===== WEEKLY =====
Outputting the possible variety of intervals (measured in days) between adjacent dates: 
In [9]:
# Are there inconsistencies between the number of unique sets of metadata and the number of instruments?
rc.identify_inconsistent_metas_instruments();
===== 4-WEEK =====
GOOD! The number of instruments is perfectly consistent the number of metas (438).
===== WEEKLY =====
GOOD! The number of instruments is perfectly consistent the number of metas (623).
In [10]:
# Is there inconcsistency in the number of instruments per date?
rc.identify_inconsistent_date_instrument_counts();
===== 4-WEEK =====
Exported dates with missing instruments:
	_exports/source/eia-weekly-psr/1982-08-20_to_2024-09-27/4-Week/inconsistencies/dates-missing-instruments.csv
===== WEEKLY =====
Exported dates with missing instruments:
	_exports/source/eia-weekly-psr/1982-08-20_to_2024-09-27/Weekly/inconsistencies/dates-missing-instruments.csv
In [11]:
# Is there inconcsistency in the number of unique sets of metadata per date?
# (It will be exactly the same as above if the instruments are consistent with the metadata sets)
rc.identify_inconsistent_date_meta_counts();
===== 4-WEEK =====
Exported dates with missing meta combinations:
	_exports/source/eia-weekly-psr/1982-08-20_to_2024-09-27/4-Week/inconsistencies/dates-missing-metas.csv
===== WEEKLY =====
Exported dates with missing meta combinations:
	_exports/source/eia-weekly-psr/1982-08-20_to_2024-09-27/Weekly/inconsistencies/dates-missing-metas.csv
In [12]:
# At instrument levels, is there inconsistency in their time series?
rc.identify_inconsistent_instrument_series();
===== 4-WEEK =====
There exists 438 (100.00%) instruments with at least one inconsistency:

	-   60 ( 13.70%) instruments had their EARLIEST date not equal to the MINIMUM date.

	-    1 (  0.23%) instruments had their LATEST date not equal to the MAXIMUM date.

	-  378 ( 86.30%) instruments were MISSING at least one date between the end points of their time series.

Exported instruments without the set-level minimum date:
	_exports/source/eia-weekly-psr/1982-08-20_to_2024-09-27/4-Week/inconsistencies/instruments-late-start.csv
Exported instruments without the set-level maximum date:
	_exports/source/eia-weekly-psr/1982-08-20_to_2024-09-27/4-Week/inconsistencies/instruments-early-end.csv
Exported instruments with missing intervals:
	_exports/source/eia-weekly-psr/1982-08-20_to_2024-09-27/4-Week/inconsistencies/instruments-gaps.csv
Exported irregularity matrix for each instrument:
	_exports/source/eia-weekly-psr/1982-08-20_to_2024-09-27/4-Week/inconsistencies/instruments-irregular.csv
===== WEEKLY =====
There exists 623 (100.00%) instruments with at least one inconsistency:

	-   68 ( 10.91%) instruments had their EARLIEST date not equal to the MINIMUM date.

	-    1 (  0.16%) instruments had their LATEST date not equal to the MAXIMUM date.

	-  563 ( 90.37%) instruments were MISSING at least one date between the end points of their time series.

Exported instruments without the set-level minimum date:
	_exports/source/eia-weekly-psr/1982-08-20_to_2024-09-27/Weekly/inconsistencies/instruments-late-start.csv
Exported instruments without the set-level maximum date:
	_exports/source/eia-weekly-psr/1982-08-20_to_2024-09-27/Weekly/inconsistencies/instruments-early-end.csv
Exported instruments with missing intervals:
	_exports/source/eia-weekly-psr/1982-08-20_to_2024-09-27/Weekly/inconsistencies/instruments-gaps.csv
Exported irregularity matrix for each instrument:
	_exports/source/eia-weekly-psr/1982-08-20_to_2024-09-27/Weekly/inconsistencies/instruments-irregular.csv

3) Duplicates¶

  • Check do we have any DUPLICATES in the data, in terms of the following subsets of the dataset:
    • the instrument_id and date columns.
    • the meta-data columns and the date columns.
In [13]:
# Are there any duplicate (instrument_id, date) pairs?
rc.identify_duplicate_instrument_dates();
===== 4-WEEK =====
GREAT! No duplicate instrument-date pairs found!
===== WEEKLY =====
GREAT! No duplicate instrument-date pairs found!
In [14]:
# Are there any duplicate (meta-data-columns, date) combinations?
rc.identify_duplicate_meta_dates();
===== 4-WEEK =====
GREAT! No duplicate meta-date pairs found!
===== WEEKLY =====
GREAT! No duplicate meta-date pairs found!

4) Outliers (experimental)¶

  • Check per instrument ID for OUTLIERS in the data z-score for each value
In [15]:
# Get outliers with z-score outside a range
rc.identify_outliers(threshold=10.0, w=15);
===== 4-WEEK =====
768 outliers detected in total!
411 unique dates had the outliers.
134 instruments (30.59%) had the outliers.
Exported (instrument_id, date) outliers:
	_exports/source/eia-weekly-psr/1982-08-20_to_2024-09-27/4-Week/outliers/value-outliersthreshold-10.0_window-15.csv
===== WEEKLY =====
1211 outliers detected in total!
569 unique dates had the outliers.
233 instruments (37.40%) had the outliers.
Exported (instrument_id, date) outliers:
	_exports/source/eia-weekly-psr/1982-08-20_to_2024-09-27/Weekly/outliers/value-outliersthreshold-10.0_window-15.csv

5) Data Freshness¶

  • Check per instrument ID whether the time series is up to date (according to their frequency and today's date)
In [16]:
rc.identify_date_freshness(as_of_date = "today");
===== 4-WEEK =====
Found 438 instruments (100.00%) with time series NOT up to date (as of 2024-10-04 00:00:00)!
Exported instruments with time series not up to date:
	_exports/source/eia-weekly-psr/1982-08-20_to_2024-09-27/4-Week/data-freshness/unfresh-instruments-2024-10-04_00,00.csv
===== WEEKLY =====
Found 623 instruments (100.00%) with time series NOT up to date (as of 2024-10-04 00:00:00)!
Exported instruments with time series not up to date:
	_exports/source/eia-weekly-psr/1982-08-20_to_2024-09-27/Weekly/data-freshness/unfresh-instruments-2024-10-04_00,00.csv

6) Plotting an instrument¶

In [17]:
# rc.run_app()

7) Generating a report¶

In [18]:
rc.generate_report();
********************************************************************************
*                                                                              *
*                           MISSING VALUES TESTS (M)                           *
*                                                                              *
********************************************************************************
================================================================================
TEST M0: Does the frequency column have any nulls?
	PASS: No nulls found in the 'frequency' column!
================================================================================
TEST M1: Do any fields have nulls?
----4-WEEK----
		FAIL: Null values found in 9 fields in total!
----WEEKLY----
		FAIL: Null values found in 9 fields in total!
(Refer to the 'identify_missing_values_per_column' function for more details)
================================================================================
TEST M2: Do any instruments have nulls in the 'value' field?
----4-WEEK----
		FAIL: Null values found in 426 (97.26%) instruments!
----WEEKLY----
		FAIL: Null values found in 556 (89.25%) instruments!
(Refer to the 'identify_missing_values_per_instrument' function for more details)
================================================================================
TEST M3: Do any dates/times have null values in the 'value' field?
----4-WEEK----
		FAIL: Null values found in 2190 dates!
----WEEKLY----
		FAIL: Null values found in 2192 dates!
(Refer to the 'identify_missing_values_per_date' function for more details)
********************************************************************************
*                                                                              *
*                            CONSISTENCY TESTS (C)                             *
*                                                                              *
********************************************************************************
================================================================================
TEST C1: Are they any dates/times or intervals which are inconsistent with the expected frequency?
----4-WEEK----
		FAIL: Inconsistent dates/times and intervals exist!
			- 3 dates/times were ABSENT from the expected frequency.
			- 0 dates/times exist that are OUTSIDE the expected interval.
----WEEKLY----
		FAIL: Inconsistent dates/times and intervals exist!
			- 6 dates/times were ABSENT from the expected frequency.
			- 0 dates/times exist that are OUTSIDE the expected interval.
(Refer to the 'identify_missing_intervals' function for more details)
================================================================================
TEST C2: Is there a one-to-one correspondence between instruments and unique metadata combinations?
----4-WEEK----
		PASS: The number of instruments is perfectly consistent the number of metas.
----WEEKLY----
		PASS: The number of instruments is perfectly consistent the number of metas.
================================================================================
TEST C3a: Do the dates in the dataset have the full set of instruments assigned to each?
----4-WEEK----
		FAIL: 1443 dates were found to have at least one instrument that was NOT assigned to it.
----WEEKLY----
		FAIL: 1445 dates were found to have at least one instrument that was NOT assigned to it.
(Refer to the 'identify_inconsistent_date_instrument_counts' function for more details)
================================================================================
TEST C3b: Do the dates in the dataset have the full set of unique metadata combinations assigned to each?
	(If C2 was successful, the results of this test will be the same as C3a.)
----4-WEEK----
		FAIL: 1443 dates were found to have at least one metadata combination that was NOT assigned to it.
----WEEKLY----
		FAIL: 1445 dates were found to have at least one metadata combination that was NOT assigned to it.
(Refer to the 'identify_inconsistent_date_meta_counts' function for more details)
================================================================================
TEST C4: Do any individual instruments have inconsistencies in terms of their time series dates?
----4-WEEK----
		FAIL: 438 (100.00%) instruments were found with an inconsistency:
			-   60 ( 13.70%) instruments had their EARLIEST date not equal to the MINIMUM date.
			-    1 (  0.23%) instruments had their LATEST date not equal to the MAXIMUM date.
			-  378 ( 86.30%) instruments were MISSING at least one date between the end points of their time series.
			-    0 (  0.00%) instruments had at least one date NOT IN LINE with the expected frequency of the time series.
----WEEKLY----
		FAIL: 623 (100.00%) instruments were found with an inconsistency:
			-   68 ( 10.91%) instruments had their EARLIEST date not equal to the MINIMUM date.
			-    1 (  0.16%) instruments had their LATEST date not equal to the MAXIMUM date.
			-  563 ( 90.37%) instruments were MISSING at least one date between the end points of their time series.
			-    0 (  0.00%) instruments had at least one date NOT IN LINE with the expected frequency of the time series.
(Refer to the 'identify_inconsistent_instrument_series' function for more details)
********************************************************************************
*                                                                              *
*                             DUPLICATES TESTS (D)                             *
*                                                                              *
********************************************************************************
================================================================================
TEST D1a: Does there exist any duplicate (instrument_id, date) combinations among the rows of the dataset?
----4-WEEK----
		PASS: No duplicate instrument-date pairs found.
----WEEKLY----
		PASS: No duplicate instrument-date pairs found.
================================================================================
TEST D1b: Does there exist any duplicate metadata-date combinations among the rows of the dataset?
----4-WEEK----
		PASS: No duplicate metadata-date pairs found.
----WEEKLY----
		PASS: No duplicate metadata-date pairs found.